62 research outputs found
EUSMT: incorporating linguistic information to SMT for a morphologically rich language. Its use in SMT-RBMT-EBMT hybridation
148 p.: graf.This thesis is defined in the framework of machine translation for Basque. Having developed a Rule-Based Machine Translation (RBMT) system for Basque in the IXA group (Mayor, 2007), we decided to tackle the Statistical Machine Translation (SMT) approach and experiment on how we could adapt it to the peculiarities of the Basque language.
First, we analyzed the impact of the agglutinative nature of Basque and the best way to deal with it. In order to deal with the problems presented above, we have split up Basque words into the lemma and some tags which represent the morphological information expressed by the inflection. By dividing each Basque word in this way, we aim to reduce the sparseness produced by the agglutinative nature of Basque and the small amount of training data.
Similarly, we also studied the differences in word order between Spanish and Basque, examining different techniques for dealing with them. we confirm the weakness of the basic SMT in dealing with great word order differences in the source and target languages. Distance-based reordering, which is the technique used by the baseline system, does not have enough information to properly handle great word order differences, so any of the techniques tested in this work (based on both statistics and manually generated rules) outperforms the baseline.
Once we had obtained a more accurate SMT system, we started the first attempts to combine different MT systems into a hybrid one that would allow us to get the best of the different paradigms. The hybridization attempts carried out in this PhD dissertation are preliminaries, but, even so, this work can help us to determine the ongoing steps.
This thesis is defined in the framework of machine translation for Basque. Having developed a Rule-Based Machine Translation (RBMT) system for Basque in the IXA group (Mayor, 2007), we decided to tackle the Statistical Machine Translation (SMT) approach and experiment on how we could adapt it to the peculiarities of the Basque language.
First, we analyzed the impact of the agglutinative nature of Basque and the best way to deal with it. In order to deal with the problems presented above, we have split up Basque words into the lemma and some tags which represent the morphological information expressed by the inflection. By dividing each Basque word in this way, we aim to reduce the sparseness produced by the agglutinative nature of Basque and the small amount of training data.
Similarly, we also studied the differences in word order between Spanish and Basque, examining different techniques for dealing with them. we confirm the weakness of the basic SMT in dealing with great word order differences in the source and target languages. Distance-based reordering, which is the technique used by the baseline system, does not have enough information to properly handle great word order differences, so any of the techniques tested in this work (based on both statistics and manually generated rules) outperforms the baseline.
Once we had obtained a more accurate SMT system, we started the first attempts to combine different MT systems into a hybrid one that would allow us to get the best of the different paradigms. The hybridization attempts carried out in this PhD dissertation are preliminaries, but, even so, this work can help us to determine the ongoing steps.Eusko Jaurlaritzaren ikertzaileak prestatzeko beka batekin (BFI05.326)eginda
Comparing rule-based and data-driven approaches to Spanish-to-Basque machine translation
In this paper, we compare the rule-based and data-driven
approaches in the context of Spanish-to-Basque Machine Translation. The rule-based system we consider has been developed specifically for Spanish-to-Basque machine translation, and is tuned to this language pair. On the contrary, the data-driven system we use is generic, and has not been specifically designed to deal with Basque. Spanish-to-Basque Machine Translation is a challenge for data-driven
approaches for at least two reasons. First, there is lack of
bilingual data on which a data-driven MT system can be trained. Second, Basque is a morphologically-rich agglutinative language and translating to Basque requires a huge generation of morphological information, a difficult task for a generic system not specifically tuned to Basque. We present the results of a series of experiments, obtained on two different corpora, one being “in-domain” and the
other one “out-of-domain” with respect to the data-driven
system. We show that n-gram based automatic evaluation and edit-distance-based human evaluation yield two different sets of results. According to BLEU, the data-driven system outperforms the rule-based system on the in-domain data, while according to the human evaluation, the rule-based
approach achieves higher scores for both corpora
A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings
Recent work has managed to learn cross-lingual word embeddings without
parallel data by mapping monolingual embeddings to a shared space through
adversarial training. However, their evaluation has focused on favorable
conditions, using comparable corpora or closely-related languages, and we show
that they often fail in more realistic scenarios. This work proposes an
alternative approach based on a fully unsupervised initialization that
explicitly exploits the structural similarity of the embeddings, and a robust
self-learning algorithm that iteratively improves this solution. Our method
succeeds in all tested scenarios and obtains the best published results in
standard datasets, even surpassing previous supervised systems. Our
implementation is released as an open source project at
https://github.com/artetxem/vecmapComment: ACL 201
An Effective Approach to Unsupervised Machine Translation
While machine translation has traditionally relied on large amounts of
parallel corpora, a recent research line has managed to train both Neural
Machine Translation (NMT) and Statistical Machine Translation (SMT) systems
using monolingual corpora only. In this paper, we identify and address several
deficiencies of existing unsupervised SMT approaches by exploiting subword
information, developing a theoretically well founded unsupervised tuning
method, and incorporating a joint refinement procedure. Moreover, we use our
improved SMT system to initialize a dual NMT model, which is further fine-tuned
through on-the-fly back-translation. Together, we obtain large improvements
over the previous state-of-the-art in unsupervised machine translation. For
instance, we get 22.5 BLEU points in English-to-German WMT 2014, 5.5 points
more than the previous best unsupervised system, and 0.5 points more than the
(supervised) shared task winner back in 2014.Comment: ACL 201
Bilingual Lexicon Induction through Unsupervised Machine Translation
A recent research line has obtained strong results on bilingual lexicon
induction by aligning independently trained word embeddings in two languages
and using the resulting cross-lingual embeddings to induce word translation
pairs through nearest neighbor or related retrieval methods. In this paper, we
propose an alternative approach to this problem that builds on the recent work
on unsupervised machine translation. This way, instead of directly inducing a
bilingual lexicon from cross-lingual embeddings, we use them to build a
phrase-table, combine it with a language model, and use the resulting machine
translation system to generate a synthetic parallel corpus, from which we
extract the bilingual lexicon using statistical word alignment techniques. As
such, our method can work with any word embedding and cross-lingual mapping
technique, and it does not require any additional resource besides the
monolingual corpus used to train the embeddings. When evaluated on the exact
same cross-lingual embeddings, our proposed method obtains an average
improvement of 6 accuracy points over nearest neighbor and 4 points over CSLS
retrieval, establishing a new state-of-the-art in the standard MUSE dataset.Comment: ACL 201
Translation Artifacts in Cross-lingual Transfer Learning
Both human and machine translation play a central role in cross-lingual
transfer learning: many multilingual datasets have been created through
professional translation services, and using machine translation to translate
either the test set or the training set is a widely used transfer technique. In
this paper, we show that such translation process can introduce subtle
artifacts that have a notable impact in existing cross-lingual models. For
instance, in natural language inference, translating the premise and the
hypothesis independently can reduce the lexical overlap between them, which
current models are highly sensitive to. We show that some previous findings in
cross-lingual transfer learning need to be reconsidered in the light of this
phenomenon. Based on the gained insights, we also improve the state-of-the-art
in XNLI for the translate-test and zero-shot approaches by 4.3 and 2.8 points,
respectively.Comment: EMNLP 202
- …